N89-19826 


Know! edge -Based Operation and Management 
of Comnunications Systems* 


Harold M. Heggestad 
M.I.T. Lincoln Laboratory 
244 Wood Street 
Lexington, MA 02173 


ABSTRACT 

Expert systems techniques are being 
applied in operation and control of the 
Defense Communications System (DCS) , 
which has the mission of providing 
reliable worldwide voice, data and 
message services for U.S. forces and 
commands. Thousands of personnel 
operate DCS facilities, and many of their 
functions match the classical expert 
system scenario: complex, skill- 

intensive environments with a full 
spectrum of problems in training and 
retention, cost containment, moderni- 
zation, and so on. Two of these 
functions have been the subject of 
research programs at Lincoln Laboratory 
over the past two years, sponsored by 
Rome Air Development Center and the 
Defense Communications Agency respect- 
ively, namely 1) fault isolation and 
restoral of dedicated circuits at Tech 
Control Centers and 2) network manage- 
ment for the Defense Switched Network 
(the modernized dial-up voice system 
currently replacing AUTOVON) . An expert 
system for the first of these is deployed 
for evaluation purposes at Andrews Air 
Force Base, and plans are being made for 
procurement of operational systems. In 
the second area, knowledge obtained with 
a sophisticated simulator is being 
embedded in an expert system. The 
background, design and status of both 
projects will be described. 

1. INTRODUCTION 

In order to maintain peak per- 
formance despite the fact that no 
electronic equipment can run indefi- 
nitely without degradation or failure, 
all communication systems must provide 
for detecting and correcting deficient 
operation. The growing technical 
disciplines of Network Management and of 
"AO&M" (Administration, Operation and 
Maintenance) in the telecommunications 
industry reflect the substantial payoffs 
that can be obtained by prompt and 
careful attention to these factors, 
keeping network performance and revenues 


continually close to peak design 
capabilities. A number of efforts have 
been undertaken to provide automated aids 
for human operators carrying out these 
functions, and in recent years artificial 
intelligence techniques have been pursued 
in the attempt to achieve consistently 
high performance despite operator skill 
and experience limitations [1]. 

For military communication systems, 
the motivations for continually main- 
taining high network performance are 
slightly different. For one thing, 
chronically tight defense budgets tend to 
limit communications expenditures to the 
bare minimum, and adequate support of 
military requirements can only be 
achieved if these minimum systems can be 
kept tuned to their peak capability. 
Another significant difference is that 
military communication systems are 
precedence-oriented: in times of 
emergency or network damage the best 
achievable service must be provided to 
the most critical users, even if this 
requires preemption or denial of service 
to less essential users. Moreover, 
military operators and technicians tend 
to be young and inexperienced compared to 
their civilian counterparts; this 
increases the risk that military networks 
may not be at their best, and creates 
even greater need for automated aid 
systems . 

The purpose of this paper is to 
briefly describe two ongoing projects 
which which are developing Expert 
Systems techniques for assisting 
military personnel in maintaining peak 
performance of military voice and data 
communications systems [ 2 , 3 , 4 , 5 ] . One 
project addresses Technical Control, 
which is the process of isolating faults 
and restoring service on critical 
dedicated circuits. The other project 
addresses Network Management for the 
worldwide voice system called the Defense 
Switched Network; this is the process of 
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allocating and controlling network 
resources to assure reliable service for 
high-precedence dial-up users, despite 
failures or congestion in major portions 
of the network. 

2. THE EXPERT TECH CONTROLLER 

Work has been in progress since 
FY86 on the Expert Tech Controller (or 
ETC) , an expert system for assisting 
humans in performing circuit fault 
isolation and service restoral at 
military Tech Control Facilities (TCFs) 
[2,3,4]. A worldwide network of some 
400 TCFs performs these functions for 
more than 61,000 dedicated circuits 
operated by the Department of Defense 
(DoD) for facilities or users who must 
have full-time connectivity because they 
are continually active, or because their 
mission requires instant communications 
in the event of emergency. Such circuits 
include, for example, high-usage long- 
haul trunks for the military switched 
voice network? dedicated circuits linking 
the Pentagon and the White House with 
military force commanders; and data links 
joining various essential computer 
systems . 

ETC was initially implemented with 
ART (trademarked acronym for Automated 
Reasoning Tool, an Expert System shell 
produced by Inference Corp.), on a 
Symbolics 3640 computer. This approach 
yielded precisely the advantages that 
one hopes for under such conditions: we 

were able to rapidly prototype a system 
that performed circuit fault isolation, 
focusing our attention on acquisition and 
understanding of knowledge in the problem 
domain rather than expending energy on 
the writing of software tools and 
facilities. By being able to frequently 
come back to the domain knowledge sources 
(senior NCOs at a major TCF, the 2045th 
Communications Group at Andrews Air Force 
Base, Washington, DC) with working 
software implementations of the knowledge 
they had given us in our previous visit, 
we were able to sustain a high level of 
enthusiasm and cooperation. 

As ETC's knowledge base grew its 
operation became slower and slower, 
however, especially during resets. Our 
analysis indicated that the generality 
and power of the ART design brought along 
much overhead that was not needed for the 
particular kinds of problems ETC had to 
deal with. Accordingly, we decided to 
reimplement the system entirely in 
ZetaLISP, the native language of the 
Symbolics machine. The resulting 
performance was entirely satisfactory; 

ETC moved along at about the rate at 
which a skilled human operator would 
reason about the problem in hand. 
Development and extension of the system 


knowledge base continued steadily 
thereafter until about mid-FY88, when it 
was decided to suspend further develop- 
pment of ETC and transfer our efforts 
entirely to an advanced project called 
MITEC (Machine-Intelligent Tech 
Controller) , described below. 

3. ETC FUNCTIONALITY 

1,000 or more circuits pass through 
a typical TCF, where the Tech Controllers 
can access them for test, patch and 
re-route purposes via banks of manual 
patch panels. Having been gradually 
implemented over a number of years, these 
circuits involve a potpourri of equipment 
types and vintages, and the acquisition 
of fault isolation skills for all the 
commonly used variations can require 
years of practice. 

The basic mission of Tech 
Controllers is to insure that the 
circuits passing through their TCF are 
continually available and operating at 
peak efficiency. Whenever a circuit 
outage occurs, Tech Controllers rapidly 
isolate the faulty segment and patch 
around it with spare facilities, or with 
facilities preempted from lower- 
precedence circuits, to restore service. 
At this point a repair order is issued 
for the appropriate repair service to 
find and fix the specific equipment 
failure. During a normal working day at 
the Andrews Air Force Base TCF there are 
typically several of these circuit outage 
problems in progress at once. On a less 
urgent basis, Tech Controllers endeavor 
to minimize failures by performing 
routine quality assurance testing on 
working circuits, in order to identify 
and correct incipient problems before 
they occur. The workload resulting from 
all these duties is substantial, and the 
actual personnel complement on board at 
a TCF is typically somewhat below the 
authorized level? moreover, two-thirds 
of these are likely to be trainees. 
Consequently there is great interest in 
the possible application of Expert 
Systems techniques for raising personnel 
work efficiency by allowing them to work 
at higher skill levels. 

The initial objective for the ETC 
development was to create a concept 
demonstration model showing how an Expert 
System could perform in the TCF environ- 
ment. The design goal for ETC was the 
capability to isolate the causes of the 
majority of the normal kinds of problems 
(e.g., no signal, receiving garble, 
excessive retransmissions) on the many 
types of circuits (e.g., voice, data, 
teletype, digital) using the wide variety 
of communication links (e.g., land lines, 
HF radio, microwave, satellite, etc.) 
between TCFs. The initial concept 
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assumed an "air gap" between the problem- 
solving logic and the communications 
equipment: ETC directed the measurement 
and data-gathering actions of a human 
operator via CRT displays, and the human 
supplied the requested information to ETC 
via mouse and keyboard. The problem- 
solving logic applied by ETC reflected 
knowledge obtained from skilled human 
practitioners. Graphics displays and text 
messages aided the novice operator in 
visualizing and understanding the fault 
isolation processes. Upon isolating the 
faulty circuit segment, ETC would supply 
instructions for the operator on how to 
patch around it. Finally ETC would 
complete the paperwork items required of 
human operators under normal circum- 
stances, namely a log of the isolation 
and restoral procedures completed and a 
repair order for fixing the faulty 
equipment. 

By early FY88 the indications were 
that the concept demonstration goals of 
ETC had indeed been achieved. It was 
estimated that ETC's knowledge base 
encompassed the circuit, device and 
fault types involved in more than half 
of the normal daily work load at Andrews. 
Planning was begun for a follow-on system 
development that would "close the air 
gap" by allowing ETC to directly access 
the communications and test equipment, 
find the fault, and electronically patch 
around it. This system would exploit 
modern remotely-controllable communi- 
cations, access and test equipment 
typical of that in use in the commercial 
telecommunications industry, and 
gradually being installed in military 
facilities. This new system, called 
MITEC (Machine Intelligent Tech 
Controller) , would be targeted for 
introduction in the field on the same 
time scale as the modern communications 
and test equipment. 

At the present time, further 
development of ETC has been suspended 
and the design of MITEC is in progress. 

A communications testbed is being 
assembled to serve as a development and 
demonstration environ-environment for 
MITEC. This testbed represents two 
modern TCFs joined by digital trunk 
circuits in the 2 4 -channel industry 
standard 1.544 Mbps DS1 format (often 
referred to as a T1 carrier) . Each TCF 
has a group of local users, and each is 
controlled by its own Expert System. The 
testbed includes voice and digital user 
terminal equipment; modems and telephone 
lines; first- and second-level 
multiplexers; a DACS-II cross-connect 
switch for T1 trunks; and HLI 3200 test 
access switches provided with HLI 3701 
and 3705 test sets. In-service and spare 
circuits are provided at several levels. 


and remotely-controlled matrix switches 
can select the desired configuration. 

The initial goals for MITEC are to 
demonstrate fault isolation and service 
restoral on all the circuit and trunk 
fault variations that are possible on 
the testbed, which will represent the 
majority of situations likely to be 
encountered at modernized TCFs. These 
processes will proceed with no air gap, 
that is, with no direct human inter- 
actions other than keeping the operator 
informed via screen messages, and giving 
the operator the go/no-go decision 
authority before a suggested circuit 
patch is actually executed. 

4. SIMULATION AND EXPERT SYSTEMS FOR 

DSN NETWORK MANAGEMENT 

The DSN (Defense Switched Network) 
is currently being implemented as a 
modern replacement for the AUTOVON 
(Automatic Voice Network) system that 
was originated in the 1950s to provide 
reliable voice service for military 
commanders in CONUS and overseas. The 
service concept in both cases is direct- 
dial long-distance service between 
authorized telephones on military 
installations, carried on government- 
leased or -owned trunk circuits (which 
are a major category, by the way, of the 
dedicated circuits handled by the Tech 
Control Facilities discussed above) . 

A key feature of the system is a 
five-level precedence and preemption 
structure (Routine, Priority, Immediate, 
Flash, and Flash Override) , in which a 
call being placed by a higher-precedence 
user can automatically preempt lower- 
precedence calls in progress if 
necessary. Another feature of the system 
is that it is engineered to provide good 
service (i.e., low blocking probability) 
for precedence users at the lowest 
possible cost. Basically, this means 
that the number of expensive long- 
distance trunks between pairs of 
switching nodes is made as small as 
possible. Routine users, who generate at 
least two-thirds of the normal peacetime 
traffic, therefore get significantly 
higher blocking probability on AUTOVON/ 
DSN than civilian customers experience on 
the commercial networks. 

The baseline requirement for 
Network Management in this Spartan 
environment is to do the best possible 
job of providing non-blocking service to 
precedence users in the face of traffic 
overloads, equipment failures or other 
disrupting influences. In the antiquated 
AUTOVON system the provisions for network 
management were minimal; in some cases 
certain manual actions were possible 
(such as re-programming switches to block 
calls to a failed switch) , but for the 
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most part the response to network 
problems was to dispatch repair crews and 
wait for service to be restored. A key 
feature of the DSN program is the 
replacement of the aging, limited AUT0V0N 
switches with modern computer-controlled 
equipment that offers far greater Network 
Management power and flexibility. Two 
immediate problems arise in seeking to 
take advantage of this power: 1) there 

is no pre-existing body of DSN Network 
Management knowledge, and 2) it appears 
that the tasks of the DSN Network Manager 
will be complex and demanding, creating 
serious manpower training and retention 
needs. 

The ongoing program in DSN Network 
Management at Lincoln Laboratory [5] has 
two main thrusts in addressing these 
problems: creating NM knowledge by 
experimentation with a powerful DSN 
simulation, and embedding this knowledge 
in an Expert System capable of advising 
less-skilled human operators as well as 
retaining a corporate memory of NM 
knowledge through personnel transfers and 
returns to civilian life. 

5. THE CALL-BY-CALL SIMULATOR (CCSIM) 

A large Fortran program has been 
developed which simulates all the DSN 
activities relevant to NM on a call-by- 
call basis. Its host computer is a Sun 
3/260 work station, which typically 
simulates faster than real time 
(depending upon the size and complexity 
of the network being simulated) . A 
CCSIM run is initialized with the 
topology and connectivity of the network 
under study, typically all the backbone 
switching nodes in a theater-wide DSN 
(i.e.. Pacific or European), and is also 
provided with the matrix of average 
busy-hour routine and precedence traffic 
levels for each source/destination pair 
in the network. Random number generators 
initiate calls in accordance with 
statistical models of caller behavior, 
with averages matching the given 
matrices. Every event associated with 
each call is modelled, including all 
route selection processing, blocking and 
preemption events at source, destination 
and intermediate nodes, and user retry 
behavior. As the simulation progresses 
the experimenter can apply overload and 
fault conditions, and he can select and 
apply network management control actions 
from an available repertoire which, in 
the real world, would be transmitted to 
switch control computers throughout the 
network and would cause modifications in 
the way switches process subsequent 
calls. 

CCSIM produces two classes of 
output information: concise local 
statistics reports for each network 
node, identical to the 5-minute 


summaries continually transmitted to 
central authority by real switches in 
the field, and exhaustive reports of the 
details of the simulation run. The 
former constitute a set of "soda straw" 
views of the network that will be the 
only statistics information available to 
DSN network management personnel at 
theatre headquarters (the Area 
Communications Operations Center or 
ACOC) , while the latter provides the 
experimenter with omniscient under- 
standing of what really happened during 
the run. Such precise and complete 
information is obtainable because CCSIM 
actually tracks every simulated call 
through its complete history, from birth 
to death. 

The process of Knowledge 
Engineering that is currently ongoing 
with CCSIM involves the initiation of 
specific damage or overload events 
during a run, followed by analysis of 
the switch reports to discern patterns 
and indicators that a network manager at 
the ACOC could have recognized as 
evidence of the existence of the par- 
ticular fault condition. The experi- 
menter then chooses a candidate NM 
control command and applies it to CCSIM, 
analyzing both the switch reports and the 
comprehensive statistics for indications 
that the control is successfully 
minimizing performance degradation caused 
by the fault condition. This knowledge 
engineering process is quite painstaking, 
involving many repetitious experiments to 
achieve statistical regularity as well as 
to understand the effects of parameter 
variations in both the damage and control 
commands. Preliminary results of this 
experimentation are described in [5]. 

In the near term, a separate effort 
has been undertaken to create an ACOC 
operator training system to develop 
personnel skills to meet the immediate 
needs of manual network management of the 
DSN, which is currently in the process of 
implementation. This training system 
will use the CCSIM to produce 5-minute 
reports as inputs to the existing 
operator console and support equipment. 

A training supervisor will set up CCSIM 
runs with representative fault 
conditions, and the operators will learn 
to recognize the statistical signs of 
trouble and to select and apply control 
actions, which will then be reflected in 
the ongoing operation of CCSIM. 

6. THE NETWORK MANAGEMENT EXPERT 

SYSTEM (NMES) 

An Expert System is being 
implemented to alleviate the DSN NM 
personnel training and retention 
problems. The long-range goal of this 
effort is to aid the ACOC NM personnel 
by performing pattern recognition on the 
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incoming stream of 5-minute switch 
reports to detect problem conditions, 
then to recommend NM control actions to 
overcome the problems — in short, to 
embody and make available the store of 
NM knowledge developed through experi- 
mentation with CCSIM, and also to be 
augmented over time with the accumulated 
experience of the operations personnel. 

In the near term, the NMES is being 
integrated with CCSIM to form an inter- 
active engineering tool in which the 
expert system can diagnose and correct 
problem conditions in the simulated 
network. This integrated system will be 
used in advancing the knowledge develop- 
ment for network managment, as well as 
for other types of network engineering. 

The Network Management Expert 
System has been implemented with ART, 
running in a Common LISP environment on 
a Sun 3/260 work station. The nature of 
the NM problem seems better matched to 
the ART structure than was the Tech 
Control environment, and it seems likely 
that the implementation will stay in ART 
for the time being. The NM problem is 
much more a matter of scanning 
collections of slowly-varying facts to 
see whether rules are satisfied, in 
contrast with the sequential nature of 
circuit fault isolation exercises. 
Moreover, while ETC had to be reset after 
every fault diagnosis in order to clear 
its world of all the schemata that were 
created while proceeding down the various 
unsuccessful and successful lines of 
inquiry, NMES can avoid time-consuming 
resets because, once it is turned on, it 
maintains an essentially continuous view 
of its problem domain. 

A group of NMES software modules 
called "monitors" process the incoming 
switch reports from CCSIM, each watching 
for a particular pattern suggested by our 
knowledge engineering activity. An 
abstract state model of the network is 
maintained, including the nature and 
location of each of the problem indi- 
cators noted by the monitors. Higher- 
level modules analyze the network state, 
postulate problem conditions, and then 
confirm or reject the conclusions over 
successive 5-minute intervals. Another 
module consults the knowledge base of NM 
control actions to correct confirmed 
problems, and sends instructions to CCSIM 
to implement the controls at all the 
switch locations specified. An 
additional module watches the effects of 
the controls applied, both to determine 
whether the controls or parameters should 
be modified, and to remove the controls 
as soon as the problem condition goes 
away. Each knowledge module in the 
expert system has its own set of monitors 
that can be turned on or off, to scan the 
switch reports for patterns of interest. 


7. SUMMARY 

Military communications systems are 
subject to manpower skill, training and 
retention problems of a quite different 
order than their commercial counterparts, 
and are thus a fertile field for the 
development of knowledge-based software 
support systems. Moreover, commercial 
automated aid systems tend not to be 
applicable to the military problems, 
because of such differences as precedence 
and pre-emption capabilities. Two 
problem areas have been selected for 
concept validation development of expert 
system techniques addressing military 
needs, namely Technical Control and 
Network Management. Both systems have 
been developed to the point of sub- 
stantial functionality, and appear likely 
to lead to transfer of the technology 
into the field. 
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